R Final Presentation: Analysis of Okazaki sequencing Data.

8/13/2020

Background

Where does DNA replication start and stop in the human genome? Okazaki sequencing (Ok-seq) is a method that can be used to infer the direction of replication-fork movement, and therefore regions of increased replication initiation and termination in the genome.

Okazaki fragments are short seqeunces of DNA that are synthesized discontinuously and then later joined together by DNA ligase to form the “lagging strand” during DNA replication.

The DNA template can only be read in 3′ to 5′ direction. (a new strand is synthesized in the 5′ to 3′ direction)

David O Morgan, en.wikipedia.org/wiki/Okazaki_fragments

Data Source

The data table is publicly available here: https://github.com/FenyoLab/Ok-Seq_Processing

It is associated with this paper: Chen, Y. H., Keegan, S., Kahli, M., Tonzi, P., Fenyö, D., Huang, T. T., & Smith, D. J. (2019). Transcription shapes DNA replication initiation and termination in human cells. Nature structural & molecular biology, 26(1), 67–77. https://doi.org/10.1038/s41594-018-0171-0

The cartoons shown to explain Okazaki Sequencing and the strand bias plots are taken from this paper.

Step 1: Load the data.

The data table is a csv file that contains data from Okazaki Sequencing (Ok-Seq) of human RPE-1 (untransformed) cells. Ok-Seq data consists of a read count on the Watson strand and a read count on the Crick strand of DNA. By viewing how the proportion of reads on the W vs. C strand changes as we move across the genome, we are able to determine regions of replication initiation (replication forks) and replication termination.

The type of visualization I will use for this data we call a “strand bias plot”. Percentage of Crick strand reads C/(W+C) are plotted from left to right. An increase in Crick strand reads indicates replication initiation.

Larger increase -> higher “efficiency”, i.e. larger percentage of cells have a replication fork at that location Slope of increase -> localization, i.e. steeper slope indicates localized origin firing vs. more gradual slope indicates de-localized (forks fire across a wider range)

This data has been collated into 1kb bins from 100kb upstream to 100kb downstream of the Transcription Start Sites (TSS) and Transcription Termination Sites (TTS) of all human protein coding genes. For example, for each gene TSS, there are 201 columns of Watson strand read data: 100kb upstream + TSS site + 100kb downstream and 201 columns of Crick strand read data.

This table also contains FPKM levels for each gene taken from RNA-Seq of RPE-1 cells (GSE89413). Harenza JL, Diamond MA, Adams RN, Song MM et al. Transcriptomic profiling of 39 commonly-used neuroblastoma cell lines. Sci Data 2017 Mar 28;4:170033. PMID: 28350380

gene_table=read.csv("sites_table_with_distributions-rpe_edu_2.csv", header=TRUE)
class(gene_table)
## [1] "data.frame"
nrow(gene_table)
## [1] 18250
ncol(gene_table)
## [1] 835
colnames(gene_table[,1:50])
##  [1] "X"                              "Unnamed..0"                    
##  [3] "X.hg19.knownGene.name"          "hg19.knownGene.chrom"          
##  [5] "hg19.knownGene.strand"          "hg19.knownGene.txStart"        
##  [7] "hg19.knownGene.txEnd"           "hg19.knownGene.cdsStart"       
##  [9] "hg19.knownGene.cdsEnd"          "hg19.knownGene.exonCount"      
## [11] "hg19.knownGene.exonStarts"      "hg19.knownGene.exonEnds"       
## [13] "hg19.knownGene.proteinID"       "hg19.knownGene.alignID"        
## [15] "hg19.kgXref.kgID"               "hg19.kgXref.geneSymbol"        
## [17] "hg19.kgXref.refseq"             "hg19.kgXref.protAcc"           
## [19] "hg19.kgXref.description"        "hg19.knownCanonical.chrom"     
## [21] "hg19.knownCanonical.chromStart" "hg19.knownCanonical.chromEnd"  
## [23] "hg19.knownCanonical.clusterId"  "hg19.knownCanonical.transcript"
## [25] "hg19.knownCanonical.protein"    "hg19.knownIsoforms.clusterId"  
## [27] "hg19.knownIsoforms.transcript"  "RPE1_fpkm"                     
## [29] "fpkm"                           "w_sum_TSS_to_TTS"              
## [31] "c_sum_TSS_to_TTS"               "dist_w_100_from_TSS_1"         
## [33] "dist_w_100_from_TSS_2"          "dist_w_100_from_TSS_3"         
## [35] "dist_w_100_from_TSS_4"          "dist_w_100_from_TSS_5"         
## [37] "dist_w_100_from_TSS_6"          "dist_w_100_from_TSS_7"         
## [39] "dist_w_100_from_TSS_8"          "dist_w_100_from_TSS_9"         
## [41] "dist_w_100_from_TSS_10"         "dist_w_100_from_TSS_11"        
## [43] "dist_w_100_from_TSS_12"         "dist_w_100_from_TSS_13"        
## [45] "dist_w_100_from_TSS_14"         "dist_w_100_from_TSS_15"        
## [47] "dist_w_100_from_TSS_16"         "dist_w_100_from_TSS_17"        
## [49] "dist_w_100_from_TSS_18"         "dist_w_100_from_TSS_19"
31+201*4
## [1] 835

Step 2: Calculate strand bias averaged over all genes using existing columns

What can we learn about replication/transcription conflicts from this data? One thing to do is to examine what the strand bias curves look like near genes. If we see evidence of replication initiation/termination, this would be interesting to explore.

First, I want to make a plot of the strand bias around the TSS. I’ll average this over all genes in the table. I have the Watson and Crick read density around the TSS: columns 32 to 232 should be the W data. And columns 233 to 433 should be the C data. I need to construct a matrix of C and W read data and then calculate the C percentage for each column.

# Check we are selecting the correct columns
colnames(gene_table[,c(32,232,233,433)])
## [1] "dist_w_100_from_TSS_1"   "dist_w_100_from_TSS_201"
## [3] "dist_c_100_from_TSS_1"   "dist_c_100_from_TSS_201"
w_data = as.matrix(gene_table[,32:232])
ave_w_data = colMeans(w_data, na.rm=TRUE)
c_data = as.matrix(gene_table[,233:433])
ave_c_data = colMeans(c_data, na.rm=TRUE)
sb_data = ave_c_data / (ave_c_data + ave_w_data)
sb_data = sb_data*100

Step 3: Make strand bias plots - average over all genes

library(ggplot2)

# I want x-axis to be -100 to +100 around the gene TSS site
to_plot <- as.data.frame(cbind(-100:100, sb_data))

# Plot strand bias round gene TSS, limit to -50/+50, set x,y labels and theme
ggplot(to_plot, aes(x = V1, y=sb_data)) + geom_line() + xlab("Distance to TSS (kb)") + ylab("% of OFs on Crick Strand") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

Normalize for transcriptional direction (Genes on Watson and Crick strand) so we can correctly visualize strand bias in the direction of transcription.

Split up by genes on the Watson (plus) strand and genes on the Crick (minus) strand. About half the genes are on the Watson strand and half on the Crick strand.

library(reshape2)
table(gene_table$hg19.knownGene.strand)
## 
##    -    + 
## 9050 9200

Same procedure as in step 2, but separate the data into Watson and Crick strand genes.

w_genes=gene_table[gene_table$hg19.knownGene.strand == "+",]
w_genes_w_data = as.matrix(w_genes[,32:232])
w_genes_ave_w_data = colMeans(w_genes_w_data, na.rm=TRUE)
w_genes_c_data = as.matrix(w_genes[,233:433])
w_genes_ave_c_data = colMeans(w_genes_c_data, na.rm=TRUE)
w_genes_sb_data = w_genes_ave_c_data / (w_genes_ave_c_data + w_genes_ave_w_data)
w_genes_sb_data = w_genes_sb_data*100
w_genes_to_plot <- as.data.frame(cbind(-100:100, w_genes_sb_data))

c_genes=gene_table[gene_table$hg19.knownGene.strand == "-",]
c_genes_w_data = as.matrix(c_genes[,32:232])
c_genes_ave_w_data = colMeans(c_genes_w_data, na.rm=TRUE)
c_genes_c_data = as.matrix(c_genes[,233:433])
c_genes_ave_c_data = colMeans(c_genes_c_data, na.rm=TRUE)
c_genes_sb_data = c_genes_ave_c_data / (c_genes_ave_c_data + c_genes_ave_w_data)
c_genes_sb_data = c_genes_sb_data*100
c_genes_to_plot <- as.data.frame(cbind(-100:100, c_genes_sb_data))

Plot W genes and C genes on the same plot by using Merge/Melt.

all_genes_to_plot = merge(w_genes_to_plot, c_genes_to_plot, by="V1")
head(all_genes_to_plot)
##     V1 w_genes_sb_data c_genes_sb_data
## 1 -100        44.08447        45.19446
## 2  -99        43.56336        44.70118
## 3  -98        43.73958        45.01203
## 4  -97        43.55835        45.26001
## 5  -96        44.09645        45.21501
## 6  -95        44.43530        45.24576
all_genes_to_plotMelted <- reshape2::melt(all_genes_to_plot, id.var='V1')
head(all_genes_to_plotMelted)
##     V1        variable    value
## 1 -100 w_genes_sb_data 44.08447
## 2  -99 w_genes_sb_data 43.56336
## 3  -98 w_genes_sb_data 43.73958
## 4  -97 w_genes_sb_data 43.55835
## 5  -96 w_genes_sb_data 44.09645
## 6  -95 w_genes_sb_data 44.43530
ggplot(all_genes_to_plotMelted, aes(x=V1, y=value, col=variable)) + geom_line() + xlab("Distance to TSS (kb)") + ylab("% of OFs on Crick Strand") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

Plot in % of forks moving Left to Right in the direction of Transcription. We need to reverse/flip the data for C genes to get this

c_genes_sb_data_norm = c_genes_ave_w_data / (c_genes_ave_c_data + c_genes_ave_w_data)
c_genes_sb_data_norm=rev(c_genes_sb_data_norm)
c_genes_sb_data_norm = c_genes_sb_data_norm*100
c_genes_to_plot_norm <- as.data.frame(cbind(-100:100, c_genes_sb_data_norm))

all_genes_to_plot = merge(w_genes_to_plot, c_genes_to_plot_norm, by="V1")
all_genes_to_plotMelted <- reshape2::melt(all_genes_to_plot, id.var='V1')
ggplot(all_genes_to_plotMelted, aes(x=V1, y=value, col=variable)) + geom_line() + xlab("Distance to TSS (kb), Txn L->R") + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

Average W and C genes for final plot.

all_genes_to_plot[,"AVE"] = apply(all_genes_to_plot[,2:3], 1, mean)
ggplot(all_genes_to_plot, aes(x = V1, y=AVE)) + geom_line() + xlab("Distance to TSS (kb), Txn L->R") + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

From this plot we can visualize replication dynamics around TSS site of genes.

sb around TSS

Look at strand bias by transcriptional level (RNA-Seq FPKM values split into 4 quartiles), gene length, and transcriptional volume

Define a function to get the strand bias data (as above) because I want to group genes first and then get this. The first argument is the data frame, and the second argument indicates if we want the data for around the TSS or the TTS.

get_sb_data <- function(df, site) 
{
  if(site == "TSS")
  {
    #print("Processing at TSS")
    w_st=32
    w_end=232
    c_st=233
    c_end=433
  }
  else
  {
    #print("Processing at TTS")
    w_st=434
    w_end=634
    c_st=635
    c_end=835
  }
    w_genes=df[df$hg19.knownGene.strand == "+",]
    w_genes_w_data = as.matrix(w_genes[,w_st:w_end])
    w_genes_ave_w_data = colMeans(w_genes_w_data, na.rm=TRUE)
    w_genes_c_data = as.matrix(w_genes[,c_st:c_end])
    w_genes_ave_c_data = colMeans(w_genes_c_data, na.rm=TRUE)
    w_genes_sb_data = w_genes_ave_c_data / (w_genes_ave_c_data + w_genes_ave_w_data)
    w_genes_sb_data = w_genes_sb_data*100
    w_genes_to_plot <- as.data.frame(cbind(-100:100, w_genes_sb_data))
    
    c_genes=gene_table[gene_table$hg19.knownGene.strand == "-",]
    c_genes_w_data = as.matrix(c_genes[,w_st:w_end])
    c_genes_ave_w_data = colMeans(c_genes_w_data, na.rm=TRUE)
    c_genes_c_data = as.matrix(c_genes[,c_st:c_end])
    c_genes_ave_c_data = colMeans(c_genes_c_data, na.rm=TRUE)
    
    c_genes_sb_data_norm = c_genes_ave_w_data / (c_genes_ave_c_data + c_genes_ave_w_data)
    c_genes_sb_data_norm=rev(c_genes_sb_data_norm)
    c_genes_sb_data_norm = c_genes_sb_data_norm*100
    c_genes_to_plot_norm <- as.data.frame(cbind(-100:100, c_genes_sb_data_norm))
    
    all_genes_to_plot = merge(w_genes_to_plot, c_genes_to_plot_norm, by="V1")
    all_genes_to_plot[,"AVE"] = apply(all_genes_to_plot[,2:3], 1, mean)
    
    return(all_genes_to_plot)
}

Now, I can split up the genes into quartiles based on FPKM from the RNA-Seq data and view the strand bias plot for genes in each quartile. I need to remove NAs first: any genes without FPKM data will be dropped. Then, get the cutoff values for the quartiles and split the data.

library(cowplot)
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
gene_table_filtered <- gene_table[!is.na(gene_table$fpkm),]
nrow(gene_table_filtered)
## [1] 18165
res=quantile(gene_table_filtered$fpkm)
print(res)
##           0%          25%          50%          75%         100% 
## 0.000000e+00 3.031276e-03 1.444829e+00 7.117673e+00 1.618647e+03
q1 = gene_table_filtered[gene_table_filtered$fpkm <= res[2],]
q2 = gene_table_filtered[(gene_table_filtered$fpkm > res[2] & gene_table_filtered$fpkm <= res[3]),]
q3 = gene_table_filtered[gene_table_filtered$fpkm > res[3] & gene_table_filtered$fpkm <= res[4],]
q4 = gene_table_filtered[gene_table_filtered$fpkm > res[4],]
to_plot1 = get_sb_data(q1, "TSS")
to_plot2 = get_sb_data(q2, "TSS")
to_plot3 = get_sb_data(q3, "TSS")
to_plot4 = get_sb_data(q4, "TSS")

q1_p <- ggplot(to_plot1, aes(x = V1, y=AVE)) + geom_line() + xlab("Distance to TSS (kb), Txn L->R") + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

q2_p <- ggplot(to_plot2, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())

q3_p <- ggplot(to_plot3, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())

q4_p <- ggplot(to_plot4, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())

Use the “plot_grid” to combine the 4 plots into one figure. Save the plot to png file.

#Combine
pg1 <- plot_grid(q4_p, q3_p, q2_p, q1_p, labels = c("Q4", "Q3", "Q2", "Q1"), nrow=4, ncol=1, label_x=c(0.15,0.15,0.15,0.15), label_y=0.95)

# Add the title
title <- ggdraw() + draw_label("FPKM", fontface = 'bold', x = 0, hjust = 0) + theme(plot.margin = margin(0, 0, 0, 7))
pg2 <- plot_grid(title, pg1, ncol = 1, rel_heights = c(0.1, 1))

# Save
ggsave(file="/Users/sarahkeegan/Dropbox/mac_files/RClass_Summer2020/images/TSS_by_FPKM.png", plot=pg2, width=2.75, height=8, dpi=300)

# Here is code to plot genes by length with FPKM > median (showing how replication initiation increases in efficiency at longer genes, for genes that are transcribed)
# And how to plot genes by transcriptional volume (FPKM*length) : longer genes have more RNA-Pol2 occupying the gene for a given FPKM
# (I generated the plots prior and display them above for purposes of the presentation b/c I didn't want to scroll through this code, which is very similar to above.)

#make a function to create the plots
make_sb_plots_by_quartile <- function(df, site, split_col_pos) 
{ 
  #remove na's from indicated column
  df_filtered <- df[!is.na(df[,split_col_pos]),]
  
  # get cutoffs
  res=quantile(df_filtered[,split_col_pos])
  
  # separate into groups based on cutoffs
  q1 = df_filtered[df_filtered[,split_col_pos] <= res[2],]
  q2 = df_filtered[(df_filtered[,split_col_pos] > res[2] & df_filtered[,split_col_pos] <= res[3]),]
  q3 = df_filtered[df_filtered[,split_col_pos] > res[3] & df_filtered[,split_col_pos] <= res[4],]
  q4 = df_filtered[df_filtered[,split_col_pos] > res[4],]
  
  # get plot data for each group, site is either TSS or TTS
  to_plot1 = get_sb_data(q1, site)
  to_plot2 = get_sb_data(q2, site)
  to_plot3 = get_sb_data(q3, site)
  to_plot4 = get_sb_data(q4, site)
  
  #make the ggplots for each quartile
  q1_p <- ggplot(to_plot1, aes(x = V1, y=AVE)) + geom_line() + xlab("Distance to TSS (kb), Txn L->R") + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70))

  q2_p <- ggplot(to_plot2, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())
  
  q3_p <- ggplot(to_plot3, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())
  
  q4_p <- ggplot(to_plot4, aes(x = V1, y=AVE)) + geom_line() + ylab("% of forks moving L->R") + theme_minimal() + coord_cartesian(xlim = c(-50, 50), ylim=c(30,70)) + theme(axis.title.x=element_blank())
  
  #combine to one figure and return
  the_plot <- plot_grid(q4_p, q3_p, q2_p, q1_p, labels = c("Q4", "Q3", "Q2", "Q1"), nrow=4, ncol=1, label_x=c(0.15,0.15,0.15,0.15), label_y=0.95)
  return(the_plot)
}

# (1) s.b. plots, fpkm > med, by gene length
gene_table_filtered <- gene_table[!is.na(gene_table$fpkm),]

# Make a new column for gene length
gene_table_filtered[,"gene_len"] = apply(gene_table_filtered[,c("hg19.knownGene.txStart","hg19.knownGene.txEnd")], 1, diff) 
len_col_pos = ncol(gene_table_filtered) # the last column is gene length

# Filter for FPKM > median
fpkm_med = median(gene_table_filtered$fpkm)
gene_table_txn <- gene_table_filtered[gene_table_filtered$fpkm > fpkm_med,]

# Split by length and make plots
pg1 <- make_sb_plots_by_quartile(gene_table_txn, "TSS", len_col_pos) 

# Add the title
title <- ggdraw() + draw_label("Length, FPKM > median", fontface = 'bold', x = 0, hjust = 0) + theme(plot.margin = margin(0, 0, 0, 7))
pg2 <- plot_grid(title, pg1, ncol = 1, rel_heights = c(0.1, 1))

# Save
ggsave(file="/Users/sarahkeegan/Dropbox/mac_files/RClass_Summer2020/images/TSS_by_Length.png", plot=pg2, width=2.75, height=8, dpi=300)
  
# (2) s.b. plots, by txn volume (fpkm x length)

# Make a new column for gene volume
gene_table_filtered[,"gene_vol"] = apply(gene_table_filtered[,c("fpkm","gene_len")], 1, prod) 
vol_col_pos = ncol(gene_table_filtered) # the last column is gene volume

# Split by volume and make plots
pg1 <- make_sb_plots_by_quartile(gene_table_filtered, "TSS", vol_col_pos) 

# Add the title
title <- ggdraw() + draw_label("Txn Volume", fontface = 'bold', x = 0, hjust = 0) + theme(plot.margin = margin(0, 0, 0, 7))
pg2 <- plot_grid(title, pg1, ncol = 1, rel_heights = c(0.1, 1))


# Save
ggsave(file="/Users/sarahkeegan/Dropbox/mac_files/RClass_Summer2020/images/TSS_by_Volume.png", plot=pg2, width=2.75, height=8, dpi=300)

Heatmaps showing strand bias for each gene

The same data can also be visualized as a heat map - instead of averaging over all genes, I can show the s.b. data for each gene as a heat map. Here, I am showing all transcribed genes (FPKM > median) sorted by length, around TSS.

I initialize a matrix to hold the strand bias data, then use a for loop to fill in the data for each row.

    library(ComplexHeatmap)
## Loading required package: grid
## ========================================
## ComplexHeatmap version 2.4.3
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
## 
## If you use it in published research, please cite:
## Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
##   genomic data. Bioinformatics 2016.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
    w_st=32
    w_end=232
    c_st=233
    c_end=433
    
    w_genes=gene_table_txn[gene_table_txn$hg19.knownGene.strand == "+",]
    w_genes_w_data = as.matrix(w_genes[,c(w_st:w_end,len_col_pos)])
    w_genes_c_data = as.matrix(w_genes[,c(c_st:c_end,len_col_pos)])
    
    c_genes=gene_table_txn[gene_table_txn$hg19.knownGene.strand == "-",]
    c_genes_w_data = as.matrix(c_genes[,c(w_st:w_end,len_col_pos)])
    c_genes_c_data = as.matrix(c_genes[,c(c_st:c_end,len_col_pos)])
    
    # Matrix initialization - 202 columns and a row for each of the genes
    w_sb_data=matrix(rep(0, 202*nrow(w_genes)), nrow = nrow(w_genes))
    c_sb_data=matrix(rep(0, 202*nrow(c_genes)), nrow = nrow(c_genes))
    
    # Loop to fill in the sb data matrices
    for(i in 1:201)
    {
        w_sb_data[,i] <- w_genes_c_data[,i] / (w_genes_w_data[,i] + w_genes_c_data[,i])
        c_sb_data[,201-i+1] <- c_genes_w_data[,i] / (c_genes_w_data[,i] + c_genes_c_data[,i]) # reverse/flip crick
        
        #add in the length to sort
        w_sb_data[,202] <- w_genes_c_data[,202]
        c_sb_data[,202] <- c_genes_c_data[,202]
    }
    
    # Combine the matrices 
    full_matrix <- rbind(w_sb_data, c_sb_data)
    
    # Sort by length
    #order(df[,1],df[,2],decreasing=TRUE)
    full_matrix_sorted <- full_matrix[order(full_matrix[,202], decreasing=TRUE),]
    
    # Make heat map
    Heatmap(full_matrix_sorted, cluster_columns = FALSE, cluster_rows = FALSE)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` arugment by explicitly setting
## TRUE/FALSE to it. Set `ht_opt$message = FALSE` to turn off this
## message.

What about TTS sites? How does that look?

Here I make a s.b. plot around the TTS and can see enrichment for replication termination localized at the TTS. It is also dependent on transcription level. Quartiles shown are for transcription volume.

# Split by volume and make plots
to_save <- make_sb_plots_by_quartile(gene_table_filtered, "TTS", vol_col_pos) 

# Save
ggsave(file="/Users/sarahkeegan/Dropbox/mac_files/RClass_Summer2020/images/TTS_by_Volume.png", plot=to_save, width=2.75, height=8, dpi=300)

TTS by Txn Volume

Discussion

The data here suggests that DNA replication initiation is coupled to transcription. Origin firing occurs preferentially immediately upstream of the TSS sites of genes with high transcriptional volume (high RNAP2 occupancy).

When replication is oriented in this way, it is traveling in the same direction as transcription, minimizing head-on collisions between the replication and transcription machinery, which can be deleterious (DNA damage etc).

We also observe termination events are highly enriched at gene 3’ ends (TTS), possibly due to the replisome colliding with paused RNAP2 leading to fork termination. These forks can be ‘rescued’ by increased replication initiation downstream of the TTS. (RNAP2 and the replication fork normally move at approximately the same speed in eukaryotes.)

Additional questions….

Ok-Seq for other cell types (Cancer)

Examination of replication dynamics at certain classes of genes such as very long genes, genes containing fragile sites

Look at replication around other genomic loci, e.g. Enhancers, etc

What happens when the system is perturbed, e.g. inducing replication stress, protein KOs, etc.